Feature/gene count input 20250212#58
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📝 WalkthroughWalkthroughThis PR adds external gene counts as a new input mode, extracts ReferenceHit into a standalone module, introduces a ChangesExternal Gene Counts Feature
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@gffquant/annotation/count_annotator.py`:
- Around line 333-347: annotate_external() currently always builds a 4-value
tuple (counts) and then adds it to self.gene_counts entries of length self.bins,
causing a shape mismatch when self.bins == 12 (strand-specific). Fix by adding a
guard and proper parsing: if self.bins != 4 then either (A) raise a clear error
(e.g., RuntimeError) explaining that strand-specific (12-bin) external matrices
are not supported yet, or (B) implement 12-bin parsing by reading the expected
per-strand columns (instead of
("uniq_raw","uniq_lnorm","combined_raw","combined_lnorm")) into a numpy array of
length 12 and then add that array to gcounts; update annotate_external(), the
local variables counts and gcounts, and the use of row[...] keys accordingly so
shapes always match.
In `@gffquant/counters/count_manager.py`:
- Line 3: Remove the unused top-level import of the csv module in
gffquant/counters/count_manager.py: delete the line importing "csv" (it's unused
in live code and only appears in commented stubs), leaving any commented
examples intact so Flake8 F401 is resolved.
In `@gffquant/handle_args.py`:
- Around line 42-47: The error message references the wrong attribute name
causing an AttributeError; update the f-string in the invalid-input check so it
uses args.gene_counts (the new option) instead of args.gene_count, keeping the
rest of the message intact and consistent with the tuple(map(bool, (has_fastq,
args.bam, args.sam, args.gene_counts))). This change ensures the ValueError is
raised with the correct variables: has_fastq, args.bam, args.sam,
args.gene_counts.
In `@gffquant/profilers/feature_quantifier.py`:
- Around line 331-341: The multi-line for-loop tuple has a hanging-indent style
that triggers E121/E125/E117; fix it by reformatting the tuple so elements are
either on the same line as the opening parenthesis or vertically aligned with
the first element (e.g., place the opening parenthesis immediately after the
assignment or align later lines under the first element), ensure the closing
parenthesis lines up with the start of the for statement, and keep the
logger.info call indented as the loop body referencing self.aln_counter and
metric/value; update the tuple in the for loop in feature_quantifier.py (the for
... in ( ... ) block that calls logger.info("%s: %s", metric,
self.aln_counter.get(value))) accordingly.
- Around line 343-346: In report_alignments(), the rate calculations using
self.aln_counter["full_read_count"] can raise ZeroDivisionError when
full_read_count is 0; modify the code that computes Alignment rate and Filter
pass rate to guard against division by zero by checking full_read_count first
(e.g., if full_read_count > 0 compute the percentage from
aln_counter["read_count"] / full_read_count and
aln_counter["filtered_read_count"] / full_read_count, otherwise set the rates to
0.0 or an appropriate sentinel like "0%"/"N/A") and then pass those safe values
into logger.info instead of performing the raw division directly.
- Around line 134-143: The external_gene_counts branch is calling
Annotator.annotate_external even when Annotator may be RegionCountAnnotator
(when self.run_mode.overlap_required is True), causing AttributeError; modify
the logic in feature_quantifier.py around the Annotator selection so that
annotate_external(...) is only invoked when the annotator is a
GeneCountAnnotator (e.g., guard with a check like not
self.run_mode.overlap_required or isinstance(Annotator(...), GeneCountAnnotator)
before calling annotate_external), and otherwise raise a clear error or fallback
to the annotate(...) path; update the block that sets total_readcount,
filtered_readcount, and has_ambig_counts accordingly so those values are only
set when annotate_external successfully runs.
- Around line 317-318: The JSON stats file is opened in binary mode which causes
json.dump(self.aln_counter, ...) to raise a TypeError; change the open call to
text mode (e.g. with open(f"{self.out_prefix}.aln_stats.json", "w",
encoding="utf-8") as aln_stats_out) so json.dump can write the string output to
aln_stats_out, leaving the variable names self.out_prefix, aln_stats_out and
self.aln_counter unchanged.
In `@gffquant/profilers/reference_hit.py`:
- Around line 23-29: The __eq__ method currently calls asdict() on other and
uses zip(sorted(...)) which can raise TypeError for non-dataclass objects and
silently truncate comparisons; update __eq__ to first check that other is an
instance of the same dataclass (use isinstance(other, type(self))) and if not
return NotImplemented, then compare the full sorted tuples of
asdict(self).items() and asdict(other).items() directly (avoid zip) so differing
field counts or values are detected correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d2109f96-b93f-47de-9f05-1479f1a790cf
📒 Files selected for processing (7)
gffquant/__init__.pygffquant/__main__.pygffquant/annotation/count_annotator.pygffquant/counters/count_manager.pygffquant/handle_args.pygffquant/profilers/feature_quantifier.pygffquant/profilers/reference_hit.py
| with open(fn, "rt", encoding="UTF-8") as _in: | ||
| for row in csv.DictReader(_in, delimiter="\t"): | ||
| # gene uniq_raw uniq_lnorm uniq_scaled combined_raw combined_lnorm combined_scaled | ||
| cols = row["uniq_raw"], row["uniq_lnorm"], row["combined_raw"], row["combined_lnorm"] | ||
| counts = tuple(map(float, cols)) | ||
| ref = row["gene"] | ||
|
|
||
| if gene_group_db: | ||
| ref_tokens = ref.split(".") | ||
| gene_id, ggroup_id = ".".join(ref_tokens[:-1]), ref_tokens[-1] | ||
| else: | ||
| ggroup_id, gene_id = ref, ref | ||
|
|
||
| gcounts = self.gene_counts.setdefault(gene_id, np.zeros(self.bins)) | ||
| gcounts += counts |
There was a problem hiding this comment.
Reject or fully parse strand-specific external matrices.
annotate_external() always builds a 4-value counts tuple, but self.bins is 12 when --strand_specific is enabled. That makes gcounts += counts fail with a NumPy shape mismatch on the first imported row.
💡 Minimal guard if 12-bin import is not supported yet
def annotate_external(self, fn, db, gene_group_db=False): # refmgr, db, count_manager, gene_group_db=False):
+ if self.bins != 4:
+ raise ValueError(
+ "--gene_counts currently supports only non-strand-specific count matrices."
+ )
with open(fn, "rt", encoding="UTF-8") as _in:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| with open(fn, "rt", encoding="UTF-8") as _in: | |
| for row in csv.DictReader(_in, delimiter="\t"): | |
| # gene uniq_raw uniq_lnorm uniq_scaled combined_raw combined_lnorm combined_scaled | |
| cols = row["uniq_raw"], row["uniq_lnorm"], row["combined_raw"], row["combined_lnorm"] | |
| counts = tuple(map(float, cols)) | |
| ref = row["gene"] | |
| if gene_group_db: | |
| ref_tokens = ref.split(".") | |
| gene_id, ggroup_id = ".".join(ref_tokens[:-1]), ref_tokens[-1] | |
| else: | |
| ggroup_id, gene_id = ref, ref | |
| gcounts = self.gene_counts.setdefault(gene_id, np.zeros(self.bins)) | |
| gcounts += counts | |
| if self.bins != 4: | |
| raise ValueError( | |
| "--gene_counts currently supports only non-strand-specific count matrices." | |
| ) | |
| with open(fn, "rt", encoding="UTF-8") as _in: | |
| for row in csv.DictReader(_in, delimiter="\t"): | |
| # gene uniq_raw uniq_lnorm uniq_scaled combined_raw combined_lnorm combined_scaled | |
| cols = row["uniq_raw"], row["uniq_lnorm"], row["combined_raw"], row["combined_lnorm"] | |
| counts = tuple(map(float, cols)) | |
| ref = row["gene"] | |
| if gene_group_db: | |
| ref_tokens = ref.split(".") | |
| gene_id, ggroup_id = ".".join(ref_tokens[:-1]), ref_tokens[-1] | |
| else: | |
| ggroup_id, gene_id = ref, ref | |
| gcounts = self.gene_counts.setdefault(gene_id, np.zeros(self.bins)) | |
| gcounts += counts |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@gffquant/annotation/count_annotator.py` around lines 333 - 347,
annotate_external() currently always builds a 4-value tuple (counts) and then
adds it to self.gene_counts entries of length self.bins, causing a shape
mismatch when self.bins == 12 (strand-specific). Fix by adding a guard and
proper parsing: if self.bins != 4 then either (A) raise a clear error (e.g.,
RuntimeError) explaining that strand-specific (12-bin) external matrices are not
supported yet, or (B) implement 12-bin parsing by reading the expected
per-strand columns (instead of
("uniq_raw","uniq_lnorm","combined_raw","combined_lnorm")) into a numpy array of
length 12 and then add that array to gcounts; update annotate_external(), the
local variables counts and gcounts, and the use of row[...] keys accordingly so
shapes always match.
| def __eq__(self, other): | ||
| return all( | ||
| item[0][1] == item[1][1] | ||
| for item in zip( | ||
| sorted(asdict(self).items()), | ||
| sorted(asdict(other).items()) | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In Python, should eqreturnNotImplementedfor unrelated types, and doesdataclasses.asdict()raiseTypeError when passed a non-dataclass instance?
💡 Result:
For eq: Yes—when your eq can’t compare meaningfully to the other object’s type, it’s recommended to return NotImplemented (rather than False). NotImplemented is the special value for rich-binary operators (including eq) to signal “not implemented with respect to the other type”; when it’s returned, Python will try the reflected operation on the other operand, and if all attempts return NotImplemented, Python falls back (for ==/!=) or raises an appropriate exception depending on the operator. Also note: NotImplemented is a distinct singleton constant (not truthy), and incorrect use can lead to misleading errors or the value being returned to user code [1]. For dataclasses.asdict(): Yes—dataclasses.asdict(obj) raises TypeError when obj is not a dataclass instance. The docs state: “asdict raises TypeError if obj is not a dataclass instance.” [2] and CPython’s implementation checks _is_dataclass_instance(obj) and raises TypeError("asdict() should be called on dataclass instances") [3].
Citations:
- 1: https://docs.python.org/3/library/constants.html?highlight=notimplemented
- 2: https://docs.python.org/3/library/dataclasses.html
- 3: https://github.com/python/cpython/blob/3.11/Lib/dataclasses.py
🏁 Script executed:
cat -n gffquant/profilers/reference_hit.py | head -40Repository: cschu/gff_quantifier
Length of output: 1143
Type-check other before calling asdict().
This __eq__ implementation will raise TypeError when comparing against non-dataclass objects instead of returning NotImplemented. Additionally, using zip() can silently truncate if other has fewer fields than self, causing unequal dataclasses to compare as equal. Add an isinstance check and use direct tuple comparison instead.
🔧 Suggested fix
def __eq__(self, other):
- return all(
- item[0][1] == item[1][1]
- for item in zip(
- sorted(asdict(self).items()),
- sorted(asdict(other).items())
- )
- )
+ if not isinstance(other, ReferenceHit):
+ return NotImplemented
+ return tuple(asdict(self).values()) == tuple(asdict(other).values())🧰 Tools
🪛 Ruff (0.15.12)
[warning] 26-29: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@gffquant/profilers/reference_hit.py` around lines 23 - 29, The __eq__ method
currently calls asdict() on other and uses zip(sorted(...)) which can raise
TypeError for non-dataclass objects and silently truncate comparisons; update
__eq__ to first check that other is an instance of the same dataclass (use
isinstance(other, type(self))) and if not return NotImplemented, then compare
the full sorted tuples of asdict(self).items() and asdict(other).items()
directly (avoid zip) so differing field counts or values are detected correctly.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
gffquant/handle_args.py (1)
30-31: ⚖️ Poor tradeoffConsider validating reference/aligner only when needed.
Currently, the reference index validation runs before determining the input type (line 49) and before checking whether
--reference/--alignerare valid for the chosen input (lines 51-54). If a user mistakenly provides--gene_countswith--aligner bwaand no--reference, they'll see "Cannot find reference index atNone" instead of the more helpful message at line 52 that these options aren't needed.♻️ Possible refactor
Move lines 30-31 after line 54, or guard them with
if has_fastq:, so the reference check only runs when those options are actually required:if has_fastq: if (args.aligner == "bwa" and not check_bwa_index(args.reference)) or \ (args.aligner == "minimap" and not check_minimap2_index(args.reference)): raise ValueError(f"Cannot find reference index at `{args.reference}`.")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@gffquant/handle_args.py` around lines 30 - 31, The reference-index validation currently runs unconditionally and can raise confusing errors when the input mode doesn't require a reference; move the conditional that calls check_bwa_index/check_minimap2_index (the block using args.aligner and args.reference) so it only runs when FASTQ-based processing is required (e.g., after the input-type detection/has_fastq determination) or wrap it with if has_fastq: to ensure the checks only occur when the code will use aligner-based processing; keep the same checks and error message but relocate them after the logic that validates --gene_counts vs FASTQ inputs (the code around determining has_fastq and the lines that validate --reference/--aligner usage).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@gffquant/handle_args.py`:
- Around line 51-52: The error message raised when (args.reference or
args.aligner) and not has_fastq is misleading because the same branch also
applies when the user provided --gene_counts; update the ValueError text in
handle_args.py to mention that --reference/--aligner are not needed with
alignment input (bam, sam) or when using --gene_counts so users see the complete
reason; locate the conditional that checks (args.reference or args.aligner) and
not has_fastq and change the exception message to include "or --gene_counts".
---
Nitpick comments:
In `@gffquant/handle_args.py`:
- Around line 30-31: The reference-index validation currently runs
unconditionally and can raise confusing errors when the input mode doesn't
require a reference; move the conditional that calls
check_bwa_index/check_minimap2_index (the block using args.aligner and
args.reference) so it only runs when FASTQ-based processing is required (e.g.,
after the input-type detection/has_fastq determination) or wrap it with if
has_fastq: to ensure the checks only occur when the code will use aligner-based
processing; keep the same checks and error message but relocate them after the
logic that validates --gene_counts vs FASTQ inputs (the code around determining
has_fastq and the lines that validate --reference/--aligner usage).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f00e31f7-3c7e-4a15-9734-0e4f74fc01e7
📒 Files selected for processing (5)
gffquant/annotation/count_annotator.pygffquant/counters/count_manager.pygffquant/handle_args.pygffquant/profilers/feature_quantifier.pygffquant/profilers/reference_hit.py
✅ Files skipped from review due to trivial changes (1)
- gffquant/counters/count_manager.py
🚧 Files skipped from review as they are similar to previous changes (3)
- gffquant/profilers/reference_hit.py
- gffquant/annotation/count_annotator.py
- gffquant/profilers/feature_quantifier.py
Summary by CodeRabbit
New Features
Bug Fixes / Validation
Chores